1. 用python处理NLP

本文是[Sanjaya’s Blog] 中 Natural Language Processing with Python的笔记，但不会翻译所有内容。

NLP是让计算机理解人类语言。广泛应用于信息检索(搜索引擎)、文本分类，自然语言生成等等。

典型的NLP应用流程如下：

1. 预处理

Tokenization

预处理一般要进过提取token，

text = "This warning shouldn't be taken lightly."
print(text.split(sep=' '))
====================================================
['This', 'warning', "shouldn't", 'be', 'taken', 'lightly.']

用re去掉标点符号punctuation character.

import regex as re
clean_text = re.sub(r"\p{P}+", "", text)
print(clean_text.split())
=================================
['This', 'warning', 'shouldnt', 'be', 'taken', 'lightly']#去掉了shouldn't中的标点符号

其中：比 str.translate(str.maketrans('', '', string.punctuation))要简洁吧！

\p{P}

\p{P} is “Any punctuation character” 标点符号
\p{Z} is “Any whitespace character” 空格

也可以用spacy.

1
2
3

#安装语言词典
pip install spacy
python -m spacy download en

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
===============================================
['This', 'warning', 'should', "n't", 'be', 'taken', 'lightly', ':)', '#', 'python', '.']

停用词移除

因为这些字符中像a, an, the等出现非常频繁，但又没什么意义，还加大计算量，影响后续结果，所以去掉。

1	print([(token.text, token.is_stop) for token in doc])

更常用的是：

text_rev_stops = [word for word in tokenize if not word in stopwords]

Stemming 词干提取

比如cats 的词干是cat， meeting词干是meet

Lemmatisation 词元提取

print ([(token.text, token.lemma_) for token in nlp("we are meeting tomorrow")])
print ([(token.text, token.lemma_) for token in nlp("i am going to a meeting")])
================================================================================
[('we', '-PRON-'), ('are', 'be'), ('meeting', 'meet'), ('tomorrow', 'tomorrow')]
[('i', 'i'), ('am', 'be'), ('going', 'go'), ('to', 'to'), ('a', 'a'), ('meeting', 'meeting')]

POS 词性标注

1
2
3

print ([(token.text, token.pos_) for token in doc])
========================================================
[('This', 'DET'), ('warning', 'NOUN'), ('should', 'VERB'), ("n't", 'ADV'), ('be', 'VERB'), ('taken', 'VERB'), ('lightly', 'ADV'), (':)', 'NOUN'), ('#', 'NOUN'), ('python', 'NOUN'), ('.', 'PUNCT')]

第一部分总结

使用包括spacy, NLTK， gensim, textblob来预处理，为后续训练和推理做准备。

2. 特征提取

这部分介绍了sklearn和spacy。

二值化编码 Binary Encoding

不是one-hot而是一个跟词汇表一样长的，文本出现的位置为1的向量。

这部分使用文本为：

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"
]

先建立词汇表：

vocab = sorted(set(word for sentence in texts for word in sentence.split()))
print(len(vocab), vocab)
==================================================================================
12 ['and', 'black', 'blue', 'car', 'crow', 'i', 'in', 'my', 'reflection', 'see', 'the', 'window']

按词汇出现位置来向量化输入文本。

import numpy as np

def binary_transform(text):
    output = np.zeros(len(vocab))
    words = set(text.split())
    #如果每个词在词汇表中就把该位置置为1
    for i, v in enumerate(vocab):
        output[i] = v in words
    return output

print(binary_transform("i saw crow"))
=====================================================
[0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

也可以直接用sklearn中的CountVectorizer。

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(binary=True)
vec.fit(texts)
print([w for w in sorted(vec.vocabulary_.keys())])
===========================================================
['and', 'black', 'blue', 'car', 'crow', 'in', 'my', 'reflection', 'see', 'the', 'window']

1
2
3

import pandas as pd

pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))

输出如下：

Counting

Counting很像Binary，但这不仅会统计会不会出现，还会计算单词出现多少次。

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=False) #默认为False,可以不写
vec.fit(texts)

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))

输出比二值化多了出现次数的信息，本质上我们会用出现次数来作为权重。但是在实际应用中，因为 a, an, have等等词将会有更大的权重。接下来会证明这方法在搜索引擎中的局限性。

TF-IDF

TF-IDF是term frequency-inverse document frequency，中文: 词频—逆文档频率。

$TF= \frac{\text{某个词在文章中出现的次数}}{\text{文章总词数}}\\\\ IDF =\text{log} (\frac{\text{语料中总文档数}}{\text{包含该词的文档数} + 1})$

如果一个词越常见，IDF越小越接近于0.加1是为了避免分母为0.

$\text{TF-IDF} = \text{TF} \times \text{IDF}$

更习惯写法是：

$\text{tf-idf }(t, d, D) = \text{tf }(t, d) \times \text{idf }(t, D)$

其中：

$t$ 是词(或统计项)
$d$ 是出现该词的文档总词数
$D$ 是语料库总文档数

使用:

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
vec.fit(texts)

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))

这部分主要介绍了不同方法转换文本为数值特征，然后“喂”给机器学习模型。

3. 文本聚类

聚类算法有Kmeans， DBSCAN， Spectral clustering(谱聚类)，层次聚类等等。

KMeans 可以用在没见过的数据集上
DBSCAN不可以用在新的没见过的数据上

jange 使用，这好像作者自己写的一个包。

from jange import ops, stream, vis

ds = stream.from_csv(
    "https://raw.githubusercontent.com/jangedoo/jange/master/dataset/bbc.csv",
    columns='news',
    context_column="type"
)

# Extract clusters
result_collector = {}
clusters_ds = ds.apply(
    ops.text.clean.pos_filter("NOUN", keep_matching_tokens=True),
    ops.text.encode.tfidf(max_features=5000, name="tfidf"),
    ops.cluster.minibatch_kmeans(n_clusters=5),
    result_collector=result_collector,
)
# Get features extracted by tfidf and reduce the dimensions
features_ds = result_collector[clusters_ds.applied_ops.find_by_name("tfidf")] 
reduced_features = features_ds.apply(ops.dim.pca(n_dim=2)) #用pca降到2维

# Visualization
vis.cluster.visualize(reduced_features, clusters_ds)

这个会报错，ValueError: not enough values to unpack (expected 2, got 0)

放下原文效果：

cluster_jange

用sklearn分析bbc 5类文档的数据集，总数为2225文档，包含商业，娱乐，政策，运动和科技。

import pandas as pd
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.datasets import load_files

random_state = 0
data = load_files(data_dir, encoding='utf-8', decode_error='replace', random_state=random_state)
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])
df.sample(10)

特征提取

#特征提取
vec = TfidfVectorizer(stop_words='english')
vec.fit(df.text.values)
features = vec.transform(df.text.values)

训练

cls = MiniBatchKMeans(n_clusters=5, random_state=random_state)
cls.fit(features)
cls.predict(features)
cls.labels_

可视化

pca = PCA(n_components=2, random_state=random_state)
reduced_features = pca.fit_transform(features.toarray()) #将特征降到2为
reduced_cluster_centers = pca.transform(cls.cluster_centers_) #将聚类中心降到2D
plt.figure(figsize=(20, 16))
plt.scatter(reduced_features[:,0], reduced_features[:, 1], c=cls.predict(features))
plt.scatter(reduced_cluster_centers[:,0], reduced_cluster_centers[:,1], marker='x', s=150, c='r')
plt.show()

vispca

评估

from sklearn.metrics import homogeneity_score #适合标签在0-1之间

homogeneity_score(df.label, cls.predict(features))
=========================================
0.5433462110559382

from sklearn.metrics import silhouette_score #标签在-1,1
silhouette_score(features, labels=cls.predict(features))
=====================================================================
0.009927737289334684

4. 主题建模

0: game match player win
1: government minister election

上面两句话，0句是运动类，1句是政策类。

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])#语言模型 禁止参数
random_state = 0
data_dir = r"/content/bbc"
data = load_files(data_dir, encoding='utf-8', decode_error='replace', random_state=random_state)
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])

def only_nouns(texts):
    output = []
    for doc in nlp.pipe(texts):
        #因为名词对于模型影响最大就只用NOUN
        noun_text = ' '.join(token.lemma_ for token in doc if token.pos_ == 'NOUN') 
        output.append(noun_text)
    return output
df['text'] = only_nouns(df['text'])
df.head()
=======================================================
	text											label
0	boss bag award executive business magazine tit...	0
1	copy bumper sale fi shooter game copy sale com...	4
2	msp climate warning climate change control dec...	2
3	pavey success view week race track bronze inju...	3
4	tory rethink association candidate election ag...	2

训练

n_topics = 5

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vec = TfidfVectorizer(max_features=5000, stop_words='english', max_df=0.85, min_df=2)
features = vec.fit_transform(df.text)

from sklearn.decomposition import NMF
cls = NMF(n_components=n_topics, random_state=random_state)
cls.fit(features)
===========================================================
NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=0, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

cls.components_将会是一个[ntopics, n_features].这儿回事[5, 5000]. `cls.components.shape`可以看到。

#向量里找到唯一次的列表
features = vec.get_feature_names()

#最影响每个主题的单词数
n_top_words = 15

for i, topic_vec in enumerate(cls.components_):
    print(i, end=' ')
    #topic_vec.argsort() 词索引按最小分数和最大分数生成的arry
    #[-1:-n_top_words:-1]切片到最大15个词
    for fid in topic_vec.argsort()[-1:-n_top_words:-1]:
        print(features[fid], end=' ')
    print()
    
=============================================================
0 growth sale economy year company market share rate price firm profit oil analyst month 
1 film award actor star actress director nomination movie year comedy role festival prize category 
2 game player match team injury club time win season coach goal victory title champion 
3 election party government tax minister leader people campaign chancellor plan issue voter country taxis 
4 phone people music technology service user broadband software computer tv network device video site

预测

new_articles = [
     "Playstation network was down so many people were angry",
    "Germany scored 7 goals against Brazil in worldcup semi-finals"           
]
cls.transform(vec.transform(new_articles)).argsort(axis=1)[:,-1]

5. 最近邻搜索

数据预处理

from sklearn.datasets import fetch_20newsgroups

bunch = fetch_20newsgroups(remove='headers')
print(type(bunch), bunch.keys())
===========================================================
<class 'sklearn.utils.Bunch'> dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

看看每个bunch数据

1
2
3

bunch.data[0]
====================================================================
 'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n'

提取特征:

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(max_features=10000)
features = vec.fit_transform(bunch.data)
print(features.shape)
================================================================
(11314, 10000)

训练

from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=10, metric='cosine')
knn.fit(features)
============================================================
NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

结果：

knn.kneighbors(features[0:1], return_distance=False)
===================================================
array([[   0,  958, 8013, 8266,  659, 5553, 3819, 2554, 6055, 7993]])



knn.kneighbors(features[0:1], return_distance=True)
===========================================
(array([[0.        , 0.35119023, 0.62822688, 0.64738668, 0.66613124,
         0.67267273, 0.68149664, 0.68833514, 0.70024449, 0.70169709]]),
 array([[   0,  958, 8013, 8266,  659, 5553, 3819, 2554, 6055, 7993]]))

input_texts = ["any recommendations for good ftp sites?", "i need to clean my car"]
input_features = vec.transform(input_texts)

D, N = knn.kneighbors(input_features, n_neighbors=2, return_distance=True)

for input_text, distances, neighbors in zip(input_texts, D, N):
    print("Input text = ", input_text[:200], "\n")
    for dist, neighbor_idx in zip(distances, neighbors):
        print("Distance = ", dist, "Neighbor idx = ", neighbor_idx)
        print(bunch.data[neighbor_idx][:200])
        print("-"*200)
    print("="*200)
    print()
    
    
    
==========================================================================
Input text =  any recommendations for good ftp sites? 

Distance =  0.5870334253639387 Neighbor idx =  89
I would like to experiment with the INTEL 8051 family.  Does anyone out  
there know of any good FTP sites that might have compiliers, assemblers,  
etc.?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance =  0.6566334116701875 Neighbor idx =  7665
Hi!

I am looking for ftp sites (where there are freewares or sharewares)
for Mac. It will help a lot if there are driver source codes in those 
ftp sites. Any information is appreciated. 

Thanks in 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================

Input text =  i need to clean my car 

Distance =  0.6592186982514803 Neighbor idx =  8013
In article <49422@fibercom.COM> rrg@rtp.fibercom.com (Rhonda Gaines) writes:
>
>I'm planning on purchasing a new car and will be trading in my '90
>Mazda MX-6 DX.  I've still got 2 more years to pay o
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance =  0.692693967282819 Neighbor idx =  7993
I bought a car with a defunct engine, to use for parts
for my old but still running version of the same car.

The car I bought has good tires.

Is there anything in particular that I should do to
stor
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================